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Abstract 

The present work addresses binary classification by use of the /c-nearest neighbors 
(fcNN) classifier. Among several assets, it belongs to intuitive majority vote classifica¬ 
tion rules and also adapts to spatial inhomogeneity, which is particularly relevant in 
high dimensional settings where no a priori partitioning of the space seems realistic. 

However the performance of the fcNN classifier crucially depends on the number 
k of neighbors that will be considered. To calibrate the parameter k, cross-validation 
procedures such as V-fold or leave-one-out are usually used. But on the one hand these 
procedures can become highly time-consuming. On the other hand, not that much 
theoretical guaranties do exist on the performance of such procedures. Recently [11] 
have derived closed-form formulas for the leave-p-out estimator of the fcNN classifier 
performance. Such formulas now allow to efficiently perform cross-validation. 

The main purpose of the present article is twofold: First, we provide a new strategy 
to derive bounds on moments of the leave-p-out estimator used to assess the perfor¬ 
mance of the fcNN classifier. This new strategy exploits the link between leave-p-out 
and U-statistics as well as the generalized Efron-Stein inequality. Second, these mo¬ 
ment upper bounds are used to settle a new exponential concentration inequality for 
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the LpO risk estimator, which characterizes its behavior with respect to the influential 
parameter k. 

1 Introduction 

Cross-validation (CV) refers to a set of widely used procedures introduced and analyzed 
by [45, 24, 44] to assess the performance of an estimator and choose the best one among 
a given collection, that is to perform model selection (or parameter calibration). For 
a given 1 < p < re — 1, all CV procedures rely on splitting a sample of cardinality re 
into two disjoint subsets called training and test sets with respective cardinality n — p 
and p. The n — p data in the training set serve to compute an estimator, while its 
performance is evaluated from the p left out data of the test set. This splitting scheme 
avoids overly optimistic performance estimations such as that of the re-substitution error 
which evaluates the performance of an estimator on the data used to compute it. For a 
complete and comprehensive review on cross-validation procedures, we refer the interested 
reader to [Ij. 

Among CV procedures, two types can be distinguished: exhaustive and non-exhaustive 
ones. For 1 < p < re— 1, the leave-p-out cross-validation (LpO) is an instance of exhaustive 
procedure since it considers all possible splits of the sample into a training set of cardinality 
re — p and a test set of cardinality p. Therefore LpO enjoys a minimal variance property 
among CV procedures with a test set of cardinality p. However due to the high induced 
computational cost, LpO cannot be computed in general, except for instance if p = 1 where 
it reduces to the celebrated leave-one-out (LIO). To avoid this high computation cost, 
approximations to the LpO procedure has been proposed such as the Hold-Out (HO) and 
the V-fold cross-validation (V-FCV) (with p = [re/Vj). V-FCV is less computationally 
demanding than LpO but more variable since it depends on a preliminary random partition 
of the data into V disjoint subsets [1]. For instance in density estimation, [12] have 
quantified the additional variance of V-FCV in comparison to that of LpO. Recently a 
new interest has been given to LpO since it has been shown closed-form formulas can be 
derived for the LpO estimator in a wide range of settings, which makes LpO attractive 
from a statistical and also a computational point of view. For instance such closed-form 
formulas have been settled in density estimation [14, 10], in nonparametric regression with 
projection or kernel estimators [13], and change-point detection [2]. 

Despite the practical success of CV, only very few things are known on its theoretical 
properties in terms of model selection with respect to p. And also probably for technical 
reasons, existing results are mainly settled for HO and LIO. For instance, the asymptotic 
equivalence between LIO and penalized criteria such as AIC or Mallow’s Cp is settled by 
[35]. In the density estimation framework, [12, 14, 13] describe the behavior of LpO in 
terms of bias and variance, while [4, 10] respectively analyze the performance of V—FCV 
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and LpO for model selection. With a focns on regression, [9] considers WFCV for which he 
settles the asymptotic dependence of the bias and variance with respect to V, and [40, 41] 
characterize the asymptotic behavior of CV for model selection in the linear regression 
framework. Note that in a general regression model, [46] recently studied the performance 
of CV to recover the best predictor among a finite number of candidates depending on 
p. The binary classification framework has also attracted some attention. First [30] 
addresses the HO model selection performance under assumptions on the VC dimension 
of the models, and then provides sanity-check bounds for the LIO [31]. Second, [47] proves 
consistency in selection for some CV procedures for identifying the best classifier among 
a finite number of candidates. A noticeable aspect in the CV literature is that unlike a 
common idea about the way p should be chosen “in practice”, numerous settings have been 
reported where the ratio p/n ^ 1 is required as n tends to +oo to be able to recover the 
best model in a collection for instance. This phenomenon, called “paradox of CV” by [47], 
had already been noticed by [40] in linear regression. More recently [46] (regression) and 
[10] (density estimation) have made this observation more precise by relating the optimal 
value of the ratio p/n to the convergence rate of the best candidate estimator. For more 
references, we refer to [1] and also to the recent paper by [49] for ongoing misleading ideas 
about CV. 

The present work mainly focuses on the popular A:-nearest neighbor rule (A:NN) intro¬ 
duced by [22]. It is based on a simple idea that the predicted value at a new point is based 
on a majority vote among the k nearest neighbors of that point. Since it automatically 
adapts the scaling to low-density regions of the space, the fcNN rule is particularly relevant 
in high dimensional settings where no preliminary partitioning of the space seems realistic. 
However from a theoretical point of view, there is no existing guideline that could help 
to choose the influential number k of neighbors in practice. In regression, [33] provide 
a bound on the performance of INN that has been further generalized to the A:NN rule 
{k > 1) by [5], where a bagged version of the fcNN rule is also analyzed and then applied 
to functional data [6]. In the present work, the main focus is given to the binary classi¬ 
fication framework where preliminary theoretical results date back to [17, 16, 25]. More 
recently, [37, 34] derived an asymptotic equivalent to the performance of the INN classi¬ 
fication rule, further extended to fcNN by [42]. Under various distributional assumptions, 
[26] derived asymptotic expansions of the risk of the A:NN classifier, which relates this risk 
to parameter k and other unknown distributional parameters. By contrast to previous 
results, [18, 15] settled finite sample upper bounds on the risk of the fcNN classiher under 
mild assumptions. 

Since these results cannot serve to provide a data-driven choice of k at this stage, this 
choice is usually made by HO or V-FCV in practice without any theoretical validation of 
the resulting choice [19, 27]. However, [43, 11] have recently derived closed-form formulas 
respectively for the bootstrap and the LpO estimator of the performance achieved by the 


3 


A;NN classification rule. In particular such formulas for the LpO estimator allow to improve 
on the more variable I^-fold estimator (with p = \n/V\), while the traditional question 
“Which p leads to the best /c?” still remains an open problem. 

The hrst step toward an answer is to understand the link between the moments of the 
LpO estimator and parameters p and k. Deriving upper bounds on these moments would 
then provide concentration inequalities [8, 7]. Some preliminary results in this direction 
are only available for the LIO {p = 1) estimator of the /cNN classiher [20, 38, 21]. 

The connection between the LpO risk estimator and D-statistics is stated in Section 2. 
A first general result is provided for order q moments {q > 2) of the LpO estimator that 
are related to moments of the LIO estimator. This result applies to any classifier as 
long as the considered quantities remain well defined. Section 3 then specifies the upper 
bounds stated by the previous result in the case of the /cNN classifier. This leads to the 
main Theorem 3.2 that characterizes the behavior of the LpO estimator with respect to 
p and k. In particular while the upper bounds increase with 1 < p < n/2 + 1, it is 
no longer the case if p > n/2 + 1. Deriving exponential concentration inequalities for 
the LpO estimator is the main concern of Section 4. We illustrate the strength of our 
strategy based on D—statistics and moment inequalities by first providing concentration 
inequalities derived with less sophisticated tools. We then state our main inequality as a 
consequence of Theorem 3.2 and highlight the improvements it allows by comparison to 
the previous ones. Finally Section 5 briefly collects some new results that are extensions to 
LpO of previous ones originally stated for LIO. This section ends with a corollary assessing 
the magnitude of the gap between the LpO estimator and the risk of the fcNN classifier 
with high probability. 

2 [/-statistics and IjpO estimator 

2.1 Statistical framework 

We tackle the binary classification problem where the goal is to predict the unknown label 
Y E {0,1} of an observation X £ X C The random variable (A, Y) has an unknown 
joint distribution P(x,Y) defined by P(x,y)(^) = S A] for any Borelian set in 

A X {0,1}, where P denotes a reference probability. To this end, one aims at building a 
classifier / : A —)• {0,1} that predicts f{x) given x £ X with the best possible classification 
error 


L(/) = p(/(A)/y). 

The minimizer of the classihcation error over the set J- all mesurable functions from X to 
{0,1} is known to be the Bayes classifier f* defined for every x £ X hy 

f*{x) = \{r)(x)>i/ 2 }, withr/(x) =P(y = 1 I A = x), (2.1) 
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where l[/(a:) denotes the indicator function of the set U that is equals to 1 if a; G C/ and 
0 otherwise, and ry(-) is the regression function of Y given X = ■. 

Any classiher f £ J- results from a strategy A, called classification algorithm or clas¬ 
sification rule, applied from a set of training random variables Zi^n = {Zi, ■ ■ ■, Z^}, where 
Zi = {Xi, Yi) for every 1 < i < re. In what follows, let us further dehne Z'" = {Zj \ i £ v} 
for any subset v C {l,...,re} such that if u = re}, Z'" = Zi^n- When applied 

to different samples, any classihcation rule A : Un>i {X x {0,1}}" —)• X, which maps a 
training sample Zi^n onto the corresponding classiher A {Ziy, ■) = f £ F, leads to differ¬ 
ent classihers. When no confusion is possible, the notation A will be used as a shortcut 
for A{Zi^ri'-,') £ F. Many classihcation rules have been considered in the literature and 
it is out of the scope of the present paper to review all of them (see [19, 27] for various 
instances). Here we mainly focus on the fc-nearest neighbor rule (/cNN) initially proposed 
by [22] and further studied for instance by [20, 38]. For 1 < k < n, the /cNN rule, denoted 
by Ak, consists in classifying any new observation x using a majority vote decision rule 
based on the label of the k closest points ..., to x among the training 

sample Xi,..., Xn- 




fkiZi n, x') 


1 = 

0 otherwise 


( 2 . 2 ) 


where 14 (x) = {l < i < re, Xi £ |X(i)(x),... , A(fc)(x)}} denotes the set of indices of the 
k nearest neighbors of x among Xi, ..., and Y(j)(x) is the label i-th neighbor of x for 
l<i<k. 

For a given sample Z\^n (respectively for a given sample size re > 1), the performance 
of any classiher / = f{Ziy, ■) is assessed by the classihcation error L{f) (respectively the 
risk i?(/)) dehned by 


L{f)=F(^f{X)AY\Zi,n) 


and R{f) = E 


P 


/ I Zi,n) 


In this paper we focus on the estimation of L{f) (and R{f)) by use of the Leave-p-Out 
(LpO) cross-validation [48, 12]. Let us briehy recall what it consists in. LpO successively 
considers all possible splits of Zi^n into a training set of cardinality n — p and a test set 
of cardinality p. The hnal LpO estimator is the average (over all these splits) of the 
classihcation error estimated on each test set: 

= (p) Yl (pY^{f{Z£Xi)itYi}\ ’ (2-3) 


where £n-p denotes the set of indices of all possible training samples of cardinality n — p, 
for every e G £n-p, e = £n-p \ e, and / (Z®; •) is the classiher built from Z®. We refer the 
reader to [1] for a detailed description of LpO and other cross-validation procedures. 
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However unlike what arises from (2.3), the LpO estimator can be efficiently computed 
by use of closed-form formulas with a time complexity linear in p when applied to the 
fcNN classification rule [11]. This property remains true in other contexts such as density 
estimation [12, 10], regression [13, 2], and so on. In particular this property contrasts with 
the usual prohibitive computational complexity suffered by LpO due to the high cardinality 
of £n-p that is equal to (”). LpO can be therefore used to assess the performance of any 
/cNN classifier and to choose the best value of the parameter k. 

2.2 [/-statistics: General bounds on LpO moments 

The purpose of the present section is describe a general strategy allowing to derive new 
upper bounds on the moments of the LpO estimator of the risk. As a first step of this 
strategy, we settle the connection between the LpO risk estimator and U-statistics. Second, 
we exploit this connection to derive new upper bounds on the moments of order g > 1 of 
the LpO estimator. In particular these upper bounds, which relate moments of the LpO 
estimator to those of the LIO estimator, hold true with any classifier. 

Let us start by introducing [/-statistics and recalling some of their basic properties. 
For an extensive review, we refer to books by [39, 32]. The first step is the definition of a 
[/-statistic of order m G N* as an average over all m-tuples of distinct indices in {1,..., n}. 

Definition 2.1 ([32]). Let h : T™ —> M. (or denote any Borelian function where 
m > 1 is an integer. Let us further assume h is a symmetric function of its arguments. 
Then any function Un : /L"" —> M such that 

Un{xi, . . . ,Xn) = Un{h){xi, . . . ,Xn) = ^ h {Xi^, . . . , Xi^) 

^ l<il<...<im<n 

where m < n, is a Lf-statistic of order m and kernel h. 

Before clarifying the connection between LpO and [/-statistics, let us introduce the 
main property of [/-statistics our strategy relies on. It consists in representing any U- 
statistic as an average, over all permutations, of sums of independent variables. 

Proposition 2.1 (Eq. (5.5) in [28]). With the notation of Definition 2.1, let us define 
IF : — >M. by 

1 

W {xi, . . . , Xn) — ~ ^ ^ h , . . . , Xrm) ; 

^ /=1 

where r = [n/m\ denotes the integer part of n/m. Then 

Un {xi , • • • , Xn) ^ IF ( 3 ^( t ( 1 ) ) • • • ! ®(T(n)) ) 

(J 

where denotes the summation over all permutations a o/{l,... ,n}. 
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We are now in position to state the key remark of the paper. All the developments 
further exposed in the following of the paper result from this connection between the LpO 
estimator defined by Eq. (2.3) and 17-statistics. 

Theorem 2.1. For any classification rule A (leading to the classifier A{Zi^n ■,') = f) 
any 1 < p < n — 1 such that the following quantities are well defined, the LpO estimator 
Rp{f) is a U-statistic of order m = n — p + 1 with kernel hm ■ T"* —)■ M defined by 

^ m 

hm{Zi, = 1 , 

i=l ^ ‘ 


where z[^l^ denotes the sample (Zi,..., Z^) with Zi withdrawn. 

Proof of Theorem 2.1. 

In what follows, let £t denote the set of all possible training sample indices e of cardinality 
t > 1. Let us start from Eq. (2.3) and notice that for every e € lE„_p, f{Z^]Xi) = 
/“* (Z® U Zj; Aj), where /“* (Z® U Zj; •) is the classifier computed from {Z® U Zj} \ Zj. 
Then, it comes 


with 


T' e&£n-v ^ *ee 


1 


(™-P+l)(n-p+l) 


E E* 


— p-\-\ 






e'e£„ 


zGe' 


fn] 
\mJ 


) • • • I ) I 


l<il<...<im<n 


hm {Zii ,..., Zj ) — 1 

m ^—- 


{f-fiz-'-,Xi)j^Yi}- 

i£e' 


□ 


The kernel hm is a deterministic and symmetric function of its arguments that does 
only depend on m. Let us also notice that hm (Zi,..., Zm) reduces to the LIO estimator 
of the risk of the classification rule A computed from Zi, ..., Zm, that is 

hm [Zi, . . . , Zm) = Rl {A {Zi^m', 0) • (2-4) 
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In the context of testing whether two binary classifiers have different error rates, this fact 
has already been pointed out by [23] . 

We now derive a general upper bound on the g-th moment [q > 1) of the LpO es¬ 
timator that holds true for any classification rule. This results from the combination of 
Proposition 2.1 and Theorem 2.1. 


Theorem 2.2. For any classification rule A, let fin = A{Zi^n] •) and fim = A{Zi^rn'-, ■) 
he the corresponding classifiers built from respectively Zi,...,Zn and Zi,..., Zm, where 
m = n — p + 1. Then for every 1 < p < n — 1 such that the following quantities are well 
defined, 


E 


Rpifin) - E 




E 


Rl idn 



(2.5) 


Furthermore as long as p > n/2 + 1, one also gets 


• for q = 2 


%{fin) - E 

Rp idn ) 

2 ] 

< 

n 

.m. 

-1 

E 

Rlifim) - E 

^ 1 2 ' 
Rl idm) 


( 2 . 6 ) 


for every q > 2 


E 


Rpidn) E Rp{gn') ^ 


2''7 


n 


Lm 


E 


Rlifim) - E 

Rl idm) 

1- 

Lm. 



.n / 

V 




2 Var \Ri{fi, 

l-l 

Lm J 


(2.7) 


where 7 > 0 is a numeric constant and B{q,'y) denotes the optimal constant defined 
in the Rosenthal inequality (Proposition D.2). 


The proof is given in Appendix A.l. Theorem 2.2 relates the upper bounds on the mo¬ 
ments of the LpO estimator to those of the LIO estimator. Eq. (2.6) and (2.7) emphasize 
different convergence rates for the moments of the LpO estimator that can be achieved in 
the particular setting where p > n/2-\-l. This point will be further discussed in Remark 2 
(following Theorem 3.2) and illustrated by Proposition 4.2. 


3 New bounds on \jpO moments for the /cNN classifier 

Our goal is now to specify the general upper bounds provided by Theorem 2.2 in the case 
of the A;NN classification rule Ak (1 < fe < u) introduced by (2.2). 
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Since Theorem 2.2 expresses moments of the LpO estimator in terms of those of the 
LIO estimator, the next step consists in focusing on the LIO moments. Deriving tight 
upper on the moments of the LIO is made by use of a generalization of the well-known 
Efron-Stein inequality (see Theorem D.l for Efron-Stein’s inequality and Theorem 15.5 in 
[7] for its generalization). Eor the sake of completeness, we hrst recall a corollary of this 
generalization that is proved in Section D.l.5 (see Corollary D.l). 

Proposition 3.1. Let Xi ,..., Xn denote n independent random variables and Z = /(Xi, ..., Xn), 
where f : M” —>• M is any Borelian function. With Z\ = f{Xi,... ,X',... ,Xn), where 
X'l,... are independent copies of the Xis, there exists a universal constant k < 1.271 
such that for any q >2, 


\Z — EZ|| < y/2Kq 




E - W 


2=1 


q/2 


Then applying Proposition 3.1 to Z = Ri{Ak {Zi^m',-)) provides the following Theo¬ 
rem 3.1, which specifies the upper bound on the g-th moment of the LIO estimator. Its 
proof is detailed in Section A. 2 . 

Theorem 3.1. For every 1 < k < n — 1, let fk,m = Ak {Zi^ml •) denote the kNN classifier 
learnt from Zi^m and Ri{fk,m) be the corresponding LIO risk estimator (see (2.3)j (m = 
n — p + 1). Then 


• for q = 2, 


E 


^l(A,m)-E Rl{fk,m) 


< CiVk 



• for every q > 2, 


E 


Rl{fk,m)-^ Rl{fk,r. 


< {C2VQr 


with Cl = 2-|- 167 (i and C 2 = where 7 ^ denotes the constant arising from Stone’s 

lemma (Lemma D.6) and k is defined in Proposition 3.1. 

We are now in position to state the main result of this section. It follows from the 
combination of Theorem 2.2 and Theorem 3.1 in a straightforward way. 

Theorem 3.2. For every p,k > 1 such that p + k < n, let Rp{fk) denote the LpO risk 
estimator (see (2.3) ) of the kNN classifier fk = Ak (^i,n; •) defined by (2.2). Then there 
exist (known) constants Ci, C 2 > 0 such that for every 1 < p < n — k, 
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for q = 2 


E 



E 




<Ci 


I I kVk \ 
yl (n-p + l) J 


• for every q> 2, 


E 


i?p(A)-E Rp{fk) 




1/2 


(3.1) 


(3.2) 


with Cl = and C 2 = where 7 ^ denotes the constant arising from Stone’s 

V Ztt 

lemma (Lemma D.6). Furthermore in the particular setting where n/2 + 1 < p < n — k, 
then 


for q = 2, 


E 


Rpifk)-^ Rpifk) 


<Ci 


ky/k 


{n -p+1) 




(3.3) 


for every q> 2, 


E 


Rpifk)Rpifk) 


< 


n 


n — p + 1 


r-? 


k'/k 


(n - p+ 1 ) 


n—p-\-l 




k"^ 


\in-P+A[yyA^\ ) 


where T = 2\/^max (V2C'i, 202 ). 

The proof is detailed in Section A.3. 

Remark 1. Eq. (3.4) results from the version of Rosenthal’s inequality derived for sym¬ 
metric random variables by [29]. In this inequality the optimal constants depend on a 
parameter j > 0 to be tuned. It has been calibrated to provide tight upper bounds in our 
setting (see Propositions D.2 and D.3). Note that the dependence of (3.4) with respect to 
q has been proved to be non improvable in [36]. 
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Remark 2. Eq. (3.3) and (3.4) focus on the setting where n/2 -\-\<p<n — k. To 
stress the interest of these bounds, let us consider the case where k and n — p are kept 
fixed as n increases. In this context Eq. (3.1) and (3.2) provide non informative upper 
bounds, whereas Eq. (3.3) and (3.4) lead to respective convergence rates at worse 1/n and 
(l/n)'^/^“^, for q > 2. The previous example is a particular instance of the more general 
setting where p/n ^ 1 as n tends to +oo, that has been investigated in various contexts 
by [40, 47, 46, 10]. 


4 Exponential concentration inequalities 


In this section, we provide exponential concentration inequalities for the LpO estimator of 
the risk when the /cNN classification rule is used. The main inequalities we provide at the 
end of this section heavily rely on the moments inequalities previously derived in Section 3 
(namely Theorem 3.2). In order to highlight the interest of our approach based on moment 
inequalities, we start this section by stating two exponential inequalities obtained with less 
sophisticated tools. For each of them, we discuss its strength and weakness to justify the 
refinements we further explore step by step. 

A first exponential concentration inequality for Rp{fk) can be derived by use of the 
bounded difference inequality following the line of proof of [19] originally developed for 
the LIO estimator. Its proof is given in Appendix B.l. 

Proposition 4.1. For any integers p, A; > 1 such thatp+k < n, let Rp{fk) denote the LpO 
estimator (2.3) of the classification error of the kNN classifier fk = Ak{Ziy, •) defined by 
(2.2). Then for every t > 0, 


P 


Rpifk) - E Rpifk 


>t] <2e 


1 A « 

8(fc+p-l)2^^ 


(4.1) 


where 7 ^ denotes the constant introduced in Stone’s lemma (Lemma D.6). 

The upper bound (4.1) obtained for the difference strongly relies on the facts that: (i) 
for Xj to be one of the k neighbors of A, in at least one subsample A®, it requires Xj to 
be one of the k + p — 1 neighbors of Aj in the complete sample, and (ii) the number of 
points for which Xj may be one of the k+p— 1 neighbors is bounded by Lemma D.6. This 
reasoning results in a rough upper bound since one does not distinguish between points for 
which Xj is among the k first neighbors or above the k-th one, whereas these are strongly 
different situations in practice. Subsequently the dependence of the convergence rate on 
k and p in Proposition 4.1 is not optimal, as confirmed by Theorems 4.1 and 4.2. 

Based on the previous comments, a sharper quantification of the influence of each 
neighbor among the k + p — 1 ones of a given point in the complete sample leads to the 
next result. 
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Theorem 4.1. For every p,k > 1 such that p + k < n, let Rp{fk) denote the LpO 
estimator (2.3) of the classification error of the kNN classifier fk = Ak{Zi^n', •) defined by 
(2.2). Then there exists a numeric constant □ > 0 such that for every t > 0, 


p (i?p(A) - E (i?p(A)) > t) 


V IT I 


y \d 


with □ = 10246^(1+7;^), where 'yd is introduced in Lemma D.6 and k < 1.271 is a universal 
constant. 


The proof is given in Section B.2. Note that with p = 1 one recovers (up to constants) 
the bound derived by [19] for LIO. Compared with Proposition 4.1, we still have a k'^ in 
the denominator but also a {k +p) x (p — l)/(n — 1) term. The latter vanishes as long as 
(fc V p)p/n = o(l), which makes this bound tighter than the previous one. 

However the upper bound of Theorem 4.1 does not reflect the same dependencies with 
respect to k and p as what has been proved for polynomial moments in Theorem 3.2. 
In particular, we do not observe the concentration improvement allowed with high order 
moments as p is chosen large enough with p > n/2 + 1 (see also the remarks following 
Theorem 3.2). This drawback is overcome by the following upper bounds. 

Theorem 4.2. For every p, k>l such thatp+k < n, let Rp{fk) denote the LpO estimator 
of the classification error of the kNN classifier fk = Ak{Zi^n', •) defined by (2.2). Then for 
every t > 0, 


Rp{fk)-E Rpifk) 


> f VP E 


Rpifk) - Rpifk) > i) < exp [-{n-P + 1) 


—) 

A2A:2 J 

(4.2) 


where A = 4y^max (C 2 ,\/Ci) with Ci,C 2 >0 defined in Theorem 3.1. 
Furthermore in the particular setting where p > n/2 + 1, it comes 


¥ (Rpifk)-E Rpifk) 


> t VP E 


exp 


-min ^ (n — p + 1) 


n 


Rpifk) \ - Rpifk) >t) <e 


n 


n-p+lj AT'^kVk 


:, \in-p+l) 


n — p + 1 


n 


n — p + 1 


X 

2 ^2 y/3' 

4r2A:2 ) 


(4.3) 


where T arises in Eq. (3.4) and 'jd denotes the constant introduced in Stone’s lemma 
(Lemma D.6). 
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The proof has been postponed to Appendix B.3. It is based on the combination 
of Theorem 3.2 and Lemma D.3 and Proposition D.l respectively to derive Eq. (4.2) 
and (4.3). Note that Theorem 4.2 provides a strict improvement upon Theorem 4.1. 
Unlike the upper bound provided in the latter theorem, Eq. (4.2) remains meaningful as 
long as p/n —>■ d £ [0,1[ as n tends to +oo, while (4.3) deals with the case where 5 = 1. 

In order to allow a better interpretation of the last inequality (4.3), we also provide 
the following proposition (proved in Appendix B.3) which focuses on the description of 
each deviation term in the particular case where p > n/2 + 1. 

Proposition 4.2. For any p,k > 1 such that p + k < n, p > n/2 + 1, and for every t > 0 


Rpifk)-^ Rpifk) 


> V^T 


kVk 


(n-p + l) 


n—p-\-l 


4^2 + 26 


k^ 


(n-p+l) 


n—p-\-l 




< 


n 


n — p + 1 


e • e 


— t 


where T > 0 is the constant arising from (3.4) . 

Let us notice that for every fixed k, both deviation terms are of order Ij\fn as long as 
p/n —)• 5 G [0,1[ since 


(n-p+l) 


n 


n-p + l 


= n(l + o(l)) = (n - p + 1) 


n 


n-p+l 


as n tends to +oo. However if p/n —)• 1 as n tends to +oo, by setting = 1 — p/n = o(l), 
it results the Hrst deviation term remains of order Ijy/n while the order of the second one 
becomes yjrfjn = o{l/y/n) as n tends to +oo. Note also that the dependence of the first 
(sub-Gaussian) deviation term with respect to k is only ky/k, which improves upon the 
k"^ provided by usual results such as Ineq. (4.2) in Theorem 4.2. 


5 Assessing the gap between IjpO and prediction error 

In the present section, we derive new upper bounds on different measures of the discrepancy 
between Rp{fk) and L{fk). These bounds on the LpO estimator are completely new for 
1 < p < n — 1. Some of them are extensions of former ones specifically derived for the 
LIO risk estimator of the A:NN classiher. 

Following the same line of proof as Theorem 2.1 in [38] originally developed for the 
LIO estimator, we were able to upper bound the mean of the (squared) difference between 
Rpifk) and L(/fc). These bounds reflect the reliable dependence of this error in terms of 
influential quantities such as p, k, and n. 
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Theorem 5.1. For every p,k > 1 such that p + k < n, let Rp{fk) denote the LpO risk 
estimator (see (2.3)) of the kNN classifier fk defined by (2.2). Then, 


Moreover, 


E 


Rpifk) - L{fk) 


4 pVk 
n 


(5.1) 


E 


(Rp{fk) - L{h)f 


2y/2 {2p + 3)y/k 1 

n n 


(5.2) 


Proof of Ineq. (5.1). 

Lemma D.7 immediately provides 

E [Rpih) - Lifk)] 

< 

E [liff)] - E [lifk)] 


< 

r{mx)^Y} - ^{fk{x)^Y}\ 


- rimxWkix)} 




□ 


The proof of Ineq. (5.2) is more intricate and has been postponed to Appendix C.l. 
First, Ineq. (5.1) can be interpreted as an upper bound on the bias between the classifi¬ 
cation error of the A:NN classifier computed from respectively n and n — p observations. 
Therefore, the fact that this upper bound increases with p seems reliable since the two 
classification errors becomes more and more different from one another as p grows. Note 
also that Ineq. (5.1) combined with Jensen’s inequality leads to a less accurate upper 
bound than (5.2). Second, let us notice we recover (up to constants) the original bound 
derived for the LIO estimator with p = 1. However we have no idea whether the precise 
dependence on p and k is optimal or not. Finally Ineq. (5.1) entails the LpO estimator 
Rpifk) of Lifk) is consistent as long as py/k/n = o(l), which is in accordance with the 
traditional consistency assumption on the fcNN classification rule, that is A;/n —)• 0 as n 
tends to -|-oo (see [19], Chap. 6.6 for instance). 


Let us conclude this section with a corollary, which provides a finite-sample bound on 


the gap between Rpifk) and Rifk) = E L(/fc) 


with high probability. It relies on the 


combination of the exponential concentration result we derived for Rpifk) (Theorem 4.2) 
with our upper bound on the bias (5.1). 
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Corollary 5.1. With the notation of Theorems J^.2 and 5.1, let us assume p,k > 1 with 
p + k < n, and p < n/2 + 1. Then for every a: > 0, there exists an event with probability 
at least 1 — 2e~^ such that 


Rifk)) - Rpifk) 


< 




A2A;2 


n 


4 py/k 

I _ Pl±'] ^ 


X + 


(5.3) 


Let us observe that if k is kept fixed (independent of n), making this gap decrease to 
0 is possible following our upper bound by requiring p/n —)• 0 as n tends to +oo. More 
precisely it allows to achieve a convergence rate of Ijy/n as long as p = 0{^/n). 


Proof of Corollary 5.1. Ineq. (5.3) results from combining Ineq. (4.2) (from Theorem 4.2) 
and Ineq. (5.1). 


RUk)) - Rpifk) 


< 

R{fk)) - E 

Rpifk) 

+ 

E 

Rpifk) 

- Rpifk) 


< 


4 py/k 
y/^ n 


+ 


I A2A:2 
_ 

n — p + 1 


□ 
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A Proofs of polynomial moment upper bounds 

A.l Proof of Theorem 2.2 

According to the proof of Proposition 2.1, it arises that the LpO estimator can be expressed 
as a [/-statistic since 


Rpidn) — ^ W (■^o-(l)) • • • ) ^cr(n)) 

cr 


with 


I -1 

^ ^ L m J 

W{Zi,...,Zn) = — '^hra{Z(^a-l)m+l,-■ ■ ,Zam) (with m = n - p1) 


a=l 


1 ^ 

and hm{Zl,. . . ,Zra) = — ^ ’ 

i=l ^ ^ 


where .) denotes the classifier based on sample z[^l^ = Zi,Zj_i, Zj+i,Z„ 

Further centering the LpO estimator, it comes 


Rp{gn)-^ Rpign) = ^ (Z^(i),... , 


where lT(Zi,..., Z„) = W(Zu ..., Z„) - E [lF(Zi,..., Z„) ]. 

Then with hm(Zi,Z^) = hm{Zi ,..., Z^) - E [/im(Zi,..., Z^) ], one gets 


E 


%{gn) - E %{gn) < E [|1T (Zi,..., Zn)\^] (Jensen's inequality) 


= E 


I -1 

^ 2 L m J 

^ ^ hm l)m+l) ■ ■ ■ ) ZijYi) 


im 


n 

mJ 


-9, 


E 


i=l 


I - I 

L m J 


91 


(A.l) 


^ ^ hm . . . , Zjm) 


i=l 
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If ^ = 2, then by independence it comes 


E 


Rpi^Qn) IE Rpi^g^i^ 


< 


n 

im 

n 

im 

n 

im 


-2 


-2 


'L^J 

Var I hm (^( 


2—l)m+l 5 • • • ? 


^im) 


2=1 


L^J 

Var [/im (^1 


(i—l)m+l! 


1 ^im) ] 


i=l 


-1 


Var 


{Ri{9- 


n—p+l)I ) 


which leads to the result. 

If g > 2 and p < n/2 + 1, then a straightforward use of Jensen’s inequality from (A.l) 
provides 


E 


^pidn) IE Rpi.Qn) 


91 


< 


n 

Lm. 


I - I 

_^ L m J 


y] E [|/im {Z[. 


2=1 


= E 


i—l)m+l) • • • ) 

91 


^li,9n—p+l) IE ) 


If O' > 2 and p > n/2 + 1, let us use Rosenthal’s inequality (Proposition D.2) by 
introducing symmetric random variables , C[n/mJ such that 

'^1 ^ ^ \jl/m\ , Qi = hjn , Ziyj^ — hm f ■ ■ ■ ’ Zijnj , 


where Z[,..., Z'^ are i.i.d. copies of Zi,..., . Then it comes for every 7 > 0 


' LtJ _ 

^ ^ hm ('2^(2—l)m+l; ■ ■ ■ ? ’^ivn^ 

<?- 


1 - 1 

L m J 

Ec. 

T 


< E 


2 = 1 



i=\ 



which implies 


’ L^J 

9 - 


LSJ 

EC. 

9 - 

^ ^ ^rra l)m+l) • • • i Zim) 


< E 


2=1 



2=1 



< < 

L^J 

7 j^Edc^nv 

/ 
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1 - 1 

L m J 

E*:[C?] 
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2=1 
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Then using for every i that 


E [ I on < 2''E [ I (n*-i)m+i,..., r ] 


it comes 


E 


I -1 

L m J 

^ ^ (n*—■ ■ ■ ’ 


i=l 


91 


< B{q,'y) { 2n 


n 


Lm 


E 


^lidn—p+l') E p+l) 
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n 


Lm 


2Var {R\(^gn—p-\-i 


Hence, it results for every q > 2 

91 


E 


^pi.Qn) E Rp{gn') 


f 0 

n 

-9+1 r 

<2^ 


E 

1 

.m. 



^li.9n—p+l) E (g'n,—p-(-l) 
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n 

Lm. 


- 9/2 


2Var [Ri{gn-p+i 


which concludes the proof. 

A.2 Proof of Theorem 3.1 

From Theorem 2.2 and Eq. (2.4), let us observe it is enough to upper bound E [\hm {Zi ,..., Zm)\'^] ■ 
Then, Proposition 3.1 provides for every q >2 


\hm{Zl, . . .,Zra)\\ < a /^ 


1 


rn , 

(jT^miZl, . . . , Zm) — hm{Zi, . . . , Zj, . . . , Zm,)^ 

i=i 


9/2 


The j-th term in the above sum is now upper bounded by 


\hm{Zl, ■ ■ ■ , Zm) hm{Zl, . . . , Zj, . . . , Zm)\ 0 ^ ^ | ^{ 

2=1 

1 1 .^1 

- 2^ \^{f(Z(9-Xi)^Yi} - 

i¥=j 

1 1 

- ~ ~ 2^ |l{/(ZW;Xi)^/(Z'.(0;X')} 

i¥=j 


(A.2) 
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where and with the notation of Theorem 2.1. 

Furthermore, let us introduce for every 1 < J < n, 

Aj = {1 <i <n, j, j € Vk{Xi)} and A'- = {l < i < n, i 7 ^ j, j e Vl{Xi)] 

where Vk{Xi) and Vl(Xi) denote the indices of the k nearest neighbors of Xi respectively 
among Xi, ...,Xj_i,Xj,Xj+i, ...Xn and Xi, ..., Xj^i, Xj, Xj+i, ...Xn- Setting Bj = Aj U 
A'-, one obtains 


1 1 


• • • ) -^m) hm{Zi, . . . ,Zp. . . , Zm)\ < ^ ^ ^ | ;J\:i)^/(Z''W;X')} 


ieBj 


(A.3) 


From now on, we distinguish between q = 2 and q > 2 because we will be able to 
derive a tighter bound for q = 2 than for q > 2. 

A.2.1 Case q > 2 

From (A.3), Stone’s lemma (Lemma D.6) provides 

hm{Zi^rn) “ ^m(-^i’fm) - ~ ~ X] ^{/(Z«;Xi)} 

i&Bj 

< J_ + 

~ m m ' 

where = (Zi,..., Zj_i, Zj, Z^+i,..., Z^). 

Summing over 1 < j < n and applying (a + < 2*?“^ (a'^ + b^) {a,b > 0 and g > 1), 

it comes 

V (hrr.{Zi,m) - hm{Z'{^jY < - (l + {2kjdf) <-i2kjdf, 

^^ V ’ Jm m 


hence 


{km{Zi ,..., Zm) — hm{Zi ,..., Zj, ..., Zm)) 
i=i 


< —{2k-idf. 
m 


g/2 


This leads for every g > 2 to 


\hm{Zl,...,Zm)\\ < 


4A:7rf 


m 


which enables to conclude. 
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A.2.2 Case q = 2 


It is possible to obtain a slightly better upper bound in the case q = 2 with the following 
reasoning. With the same notation as above and from (A.3), one has 



2 2 _ 
— — 2 - 2 ®^ 

. ) 


2 2 _ 
— —2 - 2 ®^ 

i&Bj 


using Jensen’s inequality. Lemma D .6 implies \Bj\ < 2 / 07 ^, which allows to conclude 



2 4/c7d 

< ^ + ^E 

Y1 ^{/(Z«;X 7 ^/(Z'X«);X 7 } 



[i&Bj J 


Summing over j, one derives 
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2 4A:7(i 

—- 1 - 
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El 

ieBj 


{/(Z(0;Xi)^/(Z'W;X7} 


2 4A:7(i 

=-^- -IK 

m m 


El 

ieBi 


{/(Z«;X7^/(Z'(0;X7} 


m 


< — + 

~ m m 

2=1 

2 , Ay/k 2 327(i ky/k , , ky/k 

<- + Ak^d X 2-^ = - + -j=^ < 2 + 167d ^ 

m V27rm m V27r m m 


(A.4) 


where Zq is an independent copy of and the last but one inequality results from 
Lemma D.7. 


A.3 Proof of Theorem 3.2 

Proofs of Ineq. (3.1), (3.2), and (3.3) straightforwardly result from the combination of 
Theorem 3.1 and Ineq. (2.5) and (2.6) from Theorem 2.2. 

The proof of Ineq. (3.4) results from the upper bounds settled in Theorem 3.1 and 
plugged in Ineq. (2.7) (derived from Rosenthal’s inequality with optimized constant 7 , 
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namely Proposition D.3). Then it comes 
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\j2Ci'/k 


V 


< 


n — p + 1 
with 

Ai = 2V^\l2CiVk 
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Finally introducing F = 2\/^max ^26*2, V^ClJ provides the result. 
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B Proofs of exponential concentration inequalities 


B.l Proof of Proposition 4.1 

Here the strategy is to use McDiarmid’s inequality (Theorem D.3). This justifies to start 


by upper bounding the following difference 

1 < j < n, where • • • > ■ ■ ■, Zn) 

Then using Eq. (2.3), one has 


Rp{Ak (Zpn; •)) - -Rp(A(Zi4; •)) 

1 " 1 

” i=l e 

1 " 1 

- « X] (p) ^{Ak{Z<^-,Xi)j^Ak{Z'’0.<^-,Xi)}^{i^e} 


Rp{Ak {Zi^n, •)) - Rp{Ak{Zi^n’ •)) for every 


P 


2=1 

n 


< 


„ X] (p) Y1 [^{f6E?(Xi)} + + (p) X] ^ 

e ^ e 




where Z'’h® denotes the set of random variables among having indices in e, and V^{Xi) 

(resp. denotes the set of indices of the k nearest neighbors of Xi among 

(resp. Z'h-®). 

Setting £n-p = £■, let is now introduce 

< i < n, i ^ e U {j} , 9 j or V.^Xi) 9 j] . 

Then Lemma D.6 implies Card(i??) < 2{k + p — l)7d, hence 


Rpifk) - Rpifk) 


^ (;E©^'E2'Ui 


^ ieBf 


0e} 


I ^ 4{k+p-l)jd ^ 1 
n ~ n n 


Applying McDiarmid’s inequality (Section D.1.6) then completes the proof. 


B.2 Proof of Theorem 4.1 

The hrst step of the proof consists in using Ineq. (D.5) (generalized Efron-Stein inequality) 
to upper bound the 2g-th moments of 

Rpifk) = 

ee£ ^ iee 
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With the same notation as in the proof of Proposition 4.1, one gets for every 1 < j < n 

Rp{Ak{Ziy, •)) - Rp{Ak{Z'{^n^ •)) 


(:)■ 




{h{Z<^-,Xi)^Yi} ^{fk{Z',<^,j-,Xi)^Y, 


Absolute values and Jensen’s inequality then provide 

%{Ak{Zi^n'-, •)) - %{Ak{Z'{^n\ O)! 
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-1 
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n \p 


- + -Ei 

n p 
^ 2 = 1 






2Ee 


j£e,i£e, fk{Z^;Xi) ^ fk{Z’’^’^-,Xi) 


where Pg denotes the discrete uniform probability over the set Sn-p of all n — p distinct 
indices among {1,..., n}. 

Let us further notice that |/fc(Z®;Xi) / Aj)| C |j G VkiXi) U 

where denotes the set of indices of the k nearest neighbors of Xi among 

with the notation of the proof of Proposition 4.1. Then it results 


^Pe [j G e, i G e, fk{Z^;Xi) / A(Z'’'^’A 


2=1 


< 


^Pe [j G e, i G e, J G V,%Xi) U 


2=1 


n 

< E [■?' G e, i G A j G Vi{X,) ] + p, [ j E e, i G e, j G If (A^) U P,'’'’"(A0 ] ) 
2=1 

n 

< 2^Pe [j G e, i G e, j G If (W)] , 

2=1 

which leads to 

^ ^ 12^ 

Rp{Ak{Zi,n. ■)) - Rp{Ak{Z'{i-, •)) < - + - J]Pe [j G e, i G e, j G If (W) ]. 

n p 
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Summing over 1 < j < n the square of the above quantity, it results 


IL 2 

^ •)) - RpiMz^-, •))) 


ft' 1 9 

^ i=l 


i=i 


1 


< 2 ^ “ X] £ e, i G e, j G V^{Xi) ] 


t=i 


p 


2=1 


^ “ + 8 X] I “ X] ^ ^ ^ ^ ^kiZ^i) ] I • 

j=l i=l ) 


Further using that 

j=l \p i=l J 

n ^ n 

= X] ~ X/ € e, i G e, j G V^{Xi) ])^ + 

i=i^ i=i 
n 

E - E IPe [j G e, i G e, j G [j € e, i € e, j € V,%X,) ] 

j=l ^ l<i^£<n 

= ri + T2 , 

let us now successively deal with each of these two terms. 


Upper bound on T1 First, we start by partitioning the sum over j depending on the 
rank of Xj as a neighbor of Xi in the whole sample {Xi ,..., X„). It comes 


= E E G e, * G e, J G V^{X,)]f 

3=1i=l 

= E ( E {Pe [i G e, * G e, j G V^Xi) ]}" 

i=i \jeVfc(Xi) 


^ {Pe[jGe, *Ge, jGF,'=(X,)]}' 

jeVfc+p(Xi)\Vfc(Xi) 
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Then Lemma D.5 leads to 


^ {Pe [j €e,i€e,j€ V,^{Xi)]f + {P^ [j e e, i € e, j € V.^Xi)]^ 

3&Vk(Xi) jeVk+p{Xi)\VkiXi) 

2 


< 


E ^ 


+ 


V Pe [ j G e, i G e, j G Vj^iXi) ] - 

< ^ 71 


, nn — 1 , 

j&VkiXi) '' ' jeVk+p{Xi)\Vk(Xi) 

, I p n — p\‘^ kpp—lpn — p , /p\‘^n—p 
= k I-r 1 H-r-r = k' 


nn — 1 


nn — 1 J nn — Inn — 1 \nJ n — 1 

2 


where the upper bound results from Yhj^j — aj) ^ • aj, for aj > 0. It results 


^ ! b ! l - 

^ j=l i=l ^ 


n 



n — p 

\n3 

n — 1 


k n — p 
nn — 1 


Upper bound on T2 Let us now apply the same idea to the second sum, partitioning 
the sum over j depending on the rank of j as a neighbor of i in the whole sample. Then, 


1 ” 

= E ^e[j€e, i€e, jeVaXi)]¥e[j€e, iee, j€ViiXe)] 

P j=l l<i^£<n 

1 ^ 

E Pe[jGe, iGe, 


i=i £^i ieVfc(x«) 


1 

+ “2 E E E Pe [j G e, i G e, j G V^{Xi) ] 


^ i=l 3&Vk+p{Xi)\Vk{Xi 

We then apply Stone’s lemma (Lemma D.6) to get 
T2 


kp p — 1 
n n — 1 


^ n n I 

^EE^® G e, i G e, j G 14®(X*)] I 
P i=i j=i \e^i 


p n — p 


+ E ^3&Vk+p{x,)\vuxe 




kp p — 1 
n n — 1 


l-Afcp/ pn-p kpp-l\ k^fn-p 

< ^ V — k'jd - - + [k + phd -7 = 7rf— -7 + {k+p) 

pz n \ nn — 1 
1=1 


n n — 1 


n V— 1 


n — 1 


= Id— (l + {k + p- 1 )-—^ 
n \ n — 1 
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Gathering the upper bounds The two previous bounds provide 

' kn-p e 


n C ^ n 

E =T1 + T2<-^ 

p nn — 1 

7=1 i=l ) 


n 


+ id— [I + [k+p - ly 


n 


which enables to conclude 

E;=i (i?p(A(^i,n;-)) - 

< I {l + Ak + ^e^d [l + ik+p)^]) < Jl + (A;+p)^ 

Then (D.5) provides for every q > 1 


Rpifk) - E Rpifk) < 4:y/^i 

L J 2 q I 


'8(1 + 7d)A:2 


n 


1 + (A: + p) 


p — 1 

n — 1 


Hence combined with < 7 ! > q'^e ‘^y/2Trq, it comes 

,2g 


E 


Rpifk)Rpifk) 



i + ik+p)^ 
n — 1 

\ n 

\ n 

^ P — 1 

l + (A:+p 

n — 1 


The conclusion follows from Lemma D.3 with C = 
for every t > 0 , 


l + ik+p)^. 


Then 


P (Rpifk) - E (Rpifk)) > t) V P (e (Rpifk)) - Rpifk) > t 


< exp — - 




1024eKk^l +-id) l + ik+p)^ 


B.3 Proof of Theorem 4.2 and Proposition 4.2 

Proof of Theorem 4-^- 

If p < n/2 + 1: 

From (3.1) and (3.2) applied with 2q, and further introducing a constant A = 4-y^max (y^ Gi/2 
0 , it comes for every q > 1 


E 


Rpifk)Rpifk) 


2q 


< 


A2 k"^ 

16e n — p + 1 


V /A2 k^ 

( 29 )'' < 


n — p + 1 


q\ , (B.l) 


30 
















































with < qle'^ j^2'Kq. Then Lemma D.3 provides for every t > 0 


P i2p(A)-E R^ih) 


> t VP E 


If p > n/2 + 1: 

Let us now use (3.1) and (3.4) combined with (D.l), where C = 
minj aj = 1/2. This provides for every t > 0 


RpUk) - RpUk) >t) < exp { “P + 1)^^^ • 

, go = 2, and 


n—p-\-l 


Rp{fk)-E Rpifk) 


exp-min < (n — p + 1) 

2e 


where T arises in Eq. (3.4). 


> t 

< 

n 

n — p + 1 




e X 


n 


_n-p+lj AV^ky/k 


{n-p + 1) 


n 


n — p + 1 


2 p 


1 / 3 ' 


4r2fc2 


□ 


Proof of Proposition 4 - With the same notation and reasoning as in the previous proof, 
let us combine (3.1) and (3.4). From (D.2) of Proposition D.l where C = ^^+1 > 9 o = 2, 
and minj aj = 1/2, it results for every t > 0 


Rpifk) - E 

Rpifk) 

> \f2eT 

k\fk 




\ 

(n-p + 1) 

n 

n—p-\-l 


4^/2 + (2e)3/2r 


1/2 


(n-p+1) 


+3/2 


n—P+1 


< 


n 


e • e 


—t 


n — p + 1 
where T > 0 is given by Eq. (3.4). 


□ 
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C Proofs of deviation upper bounds 

C.l Proof of Ineq. (5.2) in Theorem 5.1 

The proof follows the same strategy as that of Theorem 2.1 in [38]. 

Along the proof, we will repeatedly use some notation that we briefly introduce here. 
First, let us introduce Zq = (Ao,lo) and Zn+i = {Xn+i,Yn+i) that are independent 
copies of Zi. Second to ease the reading of the proof, we also use several shortcuts: 
/fc(Ao) = /fc(Zi,„;Ao), and /fc(e,Xo) = /fc(Zf „;A'o) for every set of indices e G £n-p 
(with cardinality n — p). Finally along the proof, e,e' G £n-p denote random sets of 
distinct indices with discrete uniform distribution over £n-p- Therefore the notation Pg 
(resp. Pe,e') is used to emphasize that integration is made with respect to e (resp. to e, e'). 


C.1.1 Main part of the proof 

Starting from 


E 


{RpUk) - L{h)f] = E \Rlifk)\ + E [Ll] - 2E [i?p(A)L(A) 


let us notice that 


E [Ll] = P (a(Xo) 7 ^ To, fk{Xn+l) / Tn+l) , 


and 


E 


RpifkWk)] = P (/fc(Xo) / To, fkie, Xi) e) Pg (i 0 e) 


It immediately comes 


E 


{Rpifk) - L{h)f 


= lE 


+ 


Kifk) - 

{P (/fe(Xo) / 


P (/fe(Xo) / To, fk{e,Xi) ^Yi\ i ^ e) Pg (i ^ e)} 

To, fk{Xn+i) / Tn+i) - P [fkiXo) / To, fk{e, X^) / Y, 


(C.l) 

^ Pg (i ^ e)| . 
(C.2) 


The proof then consists in successively upper bounding the two terms (C.l) and (C.2) of 
the last equality. 


32 














Upper bound of (C.l) First, we have 


p'E Riifk) 


Ee[ 






E^[: 

i¥=j 




Let US now introduce the following events. 


A , ■ 

-^e,e ,2 

= {i^e, ii e'}, 

ll 

e,e', 2 ,i 

= {i^e, j i e', i 

(3 

= {z ^ e, j ^ e', z 


l4 


Then, 


p'E m/k) 


(/fc(e,X0 / Y„ 7k{e',Xi) ^ Pe,e' (^e,e',0 


+ 


j;^p(A(e,X,) /Ti, /fe(e',X,) (<e'« 


i^j £=i 


= nP (/fc(e,Xi) / Ti, /fc(e',Xi) + YMe,e',l) Pe,e' {Ae,e',l) 

4 

+n(n - 1) J^P (A(e,Xi) / Ti, fk{e',X 2 ) / ^2 | <eM, 2 ) Pe,e' (<e', 


Furthermore since 




4 1 ^ 

nPe,e' (^e,e',l) + n(n “ 1) ^ Pe,e' (^le', 1 , 2 ) = ” X] ^ J ^ e') = 1, 

^=1 J ^ h.7 


it comes 


E 


fiJ(A)] -p(A(Vo) Fo.A(e.Xi) # F, 


= 4.4 + h^B, (C,3) 

pZ pZ 
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where 


A = 


(fk(e,Xi) ^ Yi, Me',Xi) / ^ | 

-P (/fc(Xo) / YoJk{e,X,) / n I ^e,eM)] Pe,e' {Ae,e',l) , 

4 

and B = Y^ [P (/fc(e, Xi) / n, /^(e', X2) / ^2 I <eM,2) 
e=i 

-P (/fc(Xo) / YoJkie,X,) / n I <eM,2)] Pe,e' (<eM,2) • 


• Upper bound for A: 

To upper bound A, simply notice that: 


< Pe,e/ (^e,e',i) < Pe,e' ^ i ^ c') < (^0 


• Upper bound for B: 

To obtain an upper bound for B, one needs to upper bound 
P (/fc(e,Xi) / Ti, A(e',X2) / ^2 | <eM,2) -P (A(^o) / >o,A(e,Xi) / Ti | 

which depends on i, i.e. on the fact that index 2 belongs or not to the training set indices 

e. 


• If 2 0 e (i.e. .^ = 1 or 3): Then, Lemma C.2 proves 


(C.4) < 


4pVk 


• If 2 E e (i.e. ^ = 2 or 4): Then, Lemma C.3 settles 


(C.4) < 


S\fk ^ ‘ip'/k 

y/^{n — p) \f^n 
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Combining the previous bounds and Lemma C.l leads to 


B < 


/ Apy/k 

y \/^n 

+ 


e,e' (^e,e',l,2) + ^e,e' (^le', 1 , 2 ) ] 






< 


< 


< 


+ 

2y/2 


2 p 

+ 


n — p n 


"K 

2y/2 


Vk 


Vk 


Pi 


' ("4e,e',l,2) + ^e,e' {^t,e',l,2) ] 
2 


'TT 

2y/2 


n 

P fP\^ 


{i ^ e, j ^ e') H-e 

n — p 


,e' (^e,e',l,2) +^^6,6' (^te', 1 , 2 )) 


+ 


(n — p)p^{p — 1) ^ (n — pYp^ 


n \nJ n—p\ n^(n —1)^ n^(n — 1)^ 


n/ 


p 2 

- + 


n n — 1 


Back to Eq. (C.3), one deduces 


E 


R: 


;(A)1 - F (a(v„) ^ r„, A(e.xo ^ r.) = + !^b < i + . 

J \ / p^ p^ n yTT n 


Upper bound of (C.2) First observe that 

P (/fc(Xo) / Yo,h{e,Xi) / y, I i ^ e) = P / Yo,fk{e,Xn+i) ^ En+i) 

where fk^ ^ is built on sample (X 2 ,Y 2 ),..., y„+i). One has 


< 

< 


P (/fc(Xo) / > 0 , fk{Xn+i) / E„+i) - P (/fc(Xo) / yo, fk{e, Xi)^Y,\ii e) 

P (/fc(Xo) / ko, fk{Xn+i) / E„+i) - P [fk~^\Xo) ^ Yo, fk{e, Xn+l) / En+l) 


P (/fc(Xo) ^ fk ^\Xo)) +P (/fe(e,X„+i) / fkiXn+i)) 

Ay/k ^ Apy/k 
y/^n yf^n 


where we used Lemma D.7 again to obtain the last inequality. 
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Conclusion: 


The conclusion simply results from combining bonds (C.l) and (C.2), which leads to 


E 


[Rpifk) - L{h)y 


2 V 2 { 2 p + 3)Vk ^ 1 

n n 


C.l.2 Combinatorial lemmas 


All the lemmas of the present section are settled with the same notation as in the proof 
of Theorem 5.1 (see Section C.1.1). 


Lemma C.l. 


IPe.e' (^e,e',l,2) 
IPe,e' 

IPe.e' 


( 

( 

( 

( 

( 

( 


n—2\ /n—2\ 

n—p) \n—p) 

n y ^ 7^^ T 

n—pJ \n—pJ 
n—p—l\ {'^—p\ 

n—2 J \n—2/ 

7^ r 7^ ) 

\n—p/ \n—p/ 

Ti—p\ /n—p—l\ 

71 — 2/ 171 — 2 ) 

7 y~ 

n—p) \n—p) 
n—p—l\ m—p—1\ 

n—2 ) \n—2 / 

TH Y~ ~7l Y~ 

\n—p) \n—p) 
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Proof of Lemma C.l. 

IPe.e' (^e,e',l,2) 


^e,e' (^e,e',jj) 


^e,e' (^e,e',i,j) 




Fe,e' {i ^ e, j ^ e', i ^ e', j ^ e) 

IPe.e' (* ie, j i e) Pe,e/ (j ^ e', i ^ e') 



\n—pJ \n—p) 


Pe,e' (* ^ e, j ^ e', i i e', j G e) 

IPe.e' (i ^ e, j G e) Pe,e/ (j ^ e', z ^ e') 


/n N 
\n—p} 


(i j i e', z G e', j ^ e) 

IPe,e' (* id, j ^ e) Pe,e' (j ^ e', z G e') 

/n-ps /n-p-lN 

U-2J U-2 ) 

Zn Wn 'I 
\n—pj \n—p/ 


Pe,e' (z ie, f i e', z G e', j G e) 

Pe,e' (z ^ e, j G e) Pe,e/ (j ^ e', z G e') 

m—p— 1 \ m—p—I n 

U-2 ) \n-2 ) 

fn ) rn \ 

\n—p) \n—p) 


□ 


Lemma C.2. With the above notation, for I G {1,3}, it comes 

P (/fc(e,Xi) ^ Yi, /fe(e',X2) / ^2 I <eM,2) -F (/fe(^o) / Yo,Me,X,) ^ Y, \ ^ 

Proof of Lemma C.2. First remind that Zq is a test sample, i.e. Zq cannot belong to either 
e or eh Consequently, an exhaustive formulation of 

P (/a,(Xo) / YoJk{e,Xi) / Fi | <eM,2) 
is 

P (a(Xo) / YoJk{e,Xi) ^ n I 4 ,,,i, 2,0 ^ e,0 ^ e') . 

Then one has 

P (/,(Xo) / To, A(e,Xi) / Ti I 4,,^ 1 ^ 2 ) 

= P [fk ~^\x 2 ) / ^ 2 , A(e,Xi) / Ti I <eM,2,0 ^ e,0 ^ e') 
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^ (- 2 ) 

where fk is built on sample {Xo,Yo), {Xi,Yi), {X3,Y3),{Xn,Yn). Hence 

P (A(e,Xi) ^ Fi, hie',X2) / F2 | Ai^e',1,2) " IP {fki^o) / Yo,Me,X,) / Fi | 
= p(A(e,Xi)^Fi, /fc(e',X2)/F2|4,,^i_2,0^e,0^e') 

-P [fk~^\x2) / F2, fk{e,X,) / Fi I <eM, 2,0 ^ e ,0 ^ e'’ 

< p({A(e,Xi) / Fi} A{/fc(e,Xi) / Fi} | 4eM,2>0 ^ e ,0 i e'^ 

+ P [{fk~^\x2) + F2} A {/fc(e',X2) / F2} I 4 eM, 2,0 ^ e ,0 ^ e') 


= P 




Apy/k 


vrn 


by Lemma D. 7 . 


□ 


Lemma C.3. With the above notation, for i G {2,4}, it comes 
p(/,(e,Xi) /Fi, h{e',X2)^Y2 \ <,,, 1 , 2 ) - P (aTO / Fq, A(e, Xi) / Fi | 

^ 8\/fc ^ Apy/k 

y/^{n — p) y/^n 

Proof of Lemma C.3. As for the previous lemma, hrst notice that 

P (a(Xo) / Fo,/fc(e,Xi) / Fi I ^ {fk^~‘'\x2) / >2, //°(Xi) / Fi | , 

where fk is built on sample e with observation (X 2 ,F 2 ) replaced with (Xo,lo). Then 
P (A(e,Xi) / Fi, fk{e',X2) / F 2 | - P (/fc(Xo) / Fo,/fc(e,Xi) / Fi | 

= P (A(e,Xi) + Fi, /A,(e',X2) / F 2 | - P (Jk^~'‘\x2) + F 2 , //“(Xi) / Fi | A^^^, 

< P ({A(e,Xi) / Fi} A {/^(Xi) / Fi} I 
+P ({/fc^"'^(X2) / F 2 } A {/fc(e',X2) / F 2 } I 

= p(A(e,Xi) + //“(Xi) I +p(/a:^"'\x 2 ) / /fc(e',X 2 ) | 

^ 8y/k ^ Apy/k 
y/^{n — p) y/^n 

□ 
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D Technical results 


D.l Main inequalities 

D.1.1 HoefFding’s lemma for finite populations 

Lemma D.l. Let denote a random sample without replacement in a finite 

population of N elements with values ci, ...,cn- If a < ci <h for i = ...,N then 


P 


1 

n 


i=\ 



< 2 exp 


2nt^ 

{b — afi 


where p. = 

This lemma is proved in [28]. 


D.l.2 Prom moment to exponential inequalities 

Proposition D.l (see also [3], Lemma 8.10). Let X denote a real valued random variable, 
and assume there exist C > 0, Ai,..., Aat > 0, and oi,..., a^v > 0 (N G N* j such that 
for every q > qo, 

( ^ 

E[i^r] <c 

Vi=l 



Then for every t > 0, 


P[|X| > t] < 

Furthermore for every x > 0, it results 


-(miniai)e ^ miiij •; I 


N 


\x\>J2x^ 


2=1 


ex 


Oti 


< • e“ 


(D.l) 


(D.2) 


Proof of Proposition D.l. By use of Markov’s inequality applied to \Xf^ [q > 0), it comes 
for every t > 0 


\XI\ > t] ^ 


ti 



g 

+ k(j<go’ 
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Now using the upper bound ^ N'maxj {Aig“*} and choosing the particular 


value q = q{t) = e ^ miuj < ( f > 




W>[\X\>t]<t^>,,C 


max. < N\j ( e 


mm.- 


t \ 


■3 \ \ NX 


]_ N \ ai N . q 


\ 


+ 1 


9<'J0 


-(mini «i) 




f - ■ i 

t ^ 


ll 

e mirij < 


J 

[j 


J +1 




which provides (D.l). 

Let us now turn to the proof of (D.2). From t* = 


= 


it arises for every x > 0 


combined with 




1 ex 


o^i 


Ef=i A. 


- < ( maxe 




Oii 


eLe 


^ g-min^ 


Oii 


Then, 


E.=iE(g^ 


C 


*\(Xi 


<Ce"(”i“'=“'')^=Ce-^ 


Hence, 


N 


|X| >^Ai 


2=1 


ex 


mm.- a 


■3 ^3 




since e'^° > 1 and — x + go min,- aj > 0 if g < go- 


□ 


D.l.3 Sub-Gaussian random variables 

Lemma D.2 (Theorem 2.1 in [7] first part). Any centered random variable X such that 
P (X > t) V P {-X >t)< e-‘E(2i^) satisfies 

E [X^s] < g! [Auf . 

for all q in N+. 
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Lemma D.3 (Theorem 2.1 in [7] second part). Any centered random variable X such that 

E < qlC‘^. 

for some C > 0 and q in N+ satisfies P (X > t) V P {—X > t) < with v = AC. 


D.1.4 The Efron-Stein inequality 


Theorem D.l (Efron-Stein’s inequality [7], Theorem 3.1). Let Xi,..., X^ he independent 
random variables and let Z = f (Xi,... , Xn) be a square-integrable function. Then 


Var(Z) < 

i=l 


= V. 


Moreover if X[,..., X'^ denote independent copies of Xi,..., Xn and if we define for every 
1 < i < n 


then 


Zl = f{Xu...,Xi,...,Xn), 

‘' = 5Ee[(z-z:)'( ■ 

2=1 


D.l.5 Generalized Efron-Stein’s inequality 


Theorem D.2 (Theorem 15.5 in [7]). Let Xi,... n independent random variables, f : 

M"" —> M a measurable function, and define Z = f{Xi,... , Xn) and Z[ = f{Xi ,.... X',.... Xn 


with X[,... ,X'n independent copies of Xi. Furthermore letV+ = E [(Z 


-Z') 


'■%i • 
l2 


XT 


andV. = E[Er[(^-^0-]' 
for all q in [2, +oo[, 


XI 


Then there exists a constant k < 1,271 such that 


||(Z-EZ)_^||^ < ^2Kq\\V+\\g/2 , 
||(Z-EZ)_||^ < ^2Kq\\V-\\g/2 ■ 
Corollary D.l. With the same notation, it comes 


Z-EZ||g < 


n 

2=1 

q/2 

< 


1 

"Wli 

1 

g/2 


(D.3) 

(D.4) 
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Moreover considering = f{Xi ,..., Xj^i, Xj^i,..., Xn) for every I < j < n, it results 


\\Z -EZ\\^<2.y^ 


'i 


i=l 



(D.5) 


Proof of Corollary D.l. 

First note that 


||(Z - EZ)Jl + ||(Z - EZ)_\\l = jjZ - EZ\\l 


Consequently, 


||Z-EZr< 


+ ^\\V- 




\q/2 


< ■\/2Kq 


g/2 
2 


i=\ 


g/2 


9/2 


Besides, 


E 


{Z-Z[f\Xl 


= E 
= E 
= E 


(Z-E[Zl + E [Z I (Xj)j^,] - Z'f I Xr 

(Z - E [Z I + (E [Z I (Xj),^,] - Z'f I Xr 

(z - E [z I (Xj),^,]f I xfl + E [(E [z' I (x,),v*] - Z'f I x; 


Combining the two previous results leads to 


||Z-EZ||^ 


<V^A 


n 

Y,{Z-E[Z\{X^),^,]f 

i=l 

bO 

+ 

n 

J]E[(E[Zn(X,),^i]-Z'f |Xf] 

i=l 

9/2 



n 

J2iZ-E[Z\iX,)^^,]f 

i=l 

9/2 




□ 
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D.1.6 McDiarmid’s inequality 

Theorem D.3. Let X\, ...,Xn be independent random variables taking values in a set A, 
and assume that f : A’^ —>■ M satisfies 

sup \f{xi,...,Xi,...,Xn)-f{xi,...,x[,...,Xn)\<Ci, l<i<n. 

Then for all e > 0, one has 

F{f{Xi,...,Xn)-E[fiXi,...,Xn)]>e) < 
F{E[f{Xi,...,Xn)]-fiXu...,Xn)>e) < e-^^" / 

A proof can be found in [19] (see Theorem 9.2). 


D.1.7 Rosenthal’s inequality 

Proposition D.2 (Eq. (20) in [29]). Let Xi,... ,Xn denote independent real random 
variables with symmetric distributions. Then for every q > 2 and 7 > 0, 


- 

n 

q- 


E 

Y,x, 


< B{q,-f) < 

- 

2=1 


K 


i=l 





where aV b = max(a, b) (a, b gF), and B{q, 7 ) denotes a positive constant only depending 
on q and 7 . Furthermore, the optimal value of B{q,j) is given by 


B*{qx) 


1 + ™ ,^f2<q<^, 

j-q/{q-TE[\Z - , if A<q, 


where N denotes a standard Gaussian variable, and Z,Z' are i.i.d. random variables with 
Poisson distribution V ( 


Proposition D.3. Let Xi,..., Xn denote independent real random variables with sym¬ 
metric distributions. Then for every q > 2, 


Proof of Proposition D.3. From Lemma D.4, let us observe 


E 


2=1 


< 2V2e 


E 

2=1 




2=1 
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• if 2 < g < 4, 




by choosing 7 = 1 . 
if 4 < g, 


B*{q,'y) < q Aeq + 9 )^ ^ O' , 


with 7 = q^^ ^)l'^, 

Plugging the previous upper bounds in Rosenthal’s inequality (Proposition D.2), it results 
for every q > 2 


^x. 

2=1 


< 2V2e^ < 


(v^r^EiiXiriv 


2 = 1 




which leads to the conclusion. 


□ 


Lemma D.4. With the same notation as Proposition D.2 and for every 7 > 0, it comes 
• for every 2 < q < A, 


B* 


7 


for every A < q, 


B*{q,'y) <7 Aeq ( 7 ^/+ g) ^ 


Proof of Lemma D.4. If 2 < g < 4, 

7 7 7 7 

by use of Lemma D.IO and ^ fo^' every q> 2. 
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If g > 4, 

=7-9/(9-i)£; [Iz-Z'l*^] 

< 7 - 9 /( 9 - 1 ) 29 / 2 + 16^9 [- +9 

< 7-9/(9-1)29/2^9^, J I J'^l/(,_1) ^ 


g/2 

19/2 


< ^-9/(9-l) 


4eg (7 


,4/(-?-i) 


+ 9 


<?/2 


= ^-9/(9-l) 


applying Lemma D.12 with A = 1 / 271/(9 i). 


Y^ 4 eg (71/(9 1) + gr) 


□ 


D.2 Technical lemmas 

D.2.1 Basic computations for resampling applied to the fcNN algorithm 
Lemma D.5. For every 1 < i < n and 1 < p < n, one has 

Pe (i G e) = ^ (D.6) 

n 

j;Pe[iGe, j Gl/f(X,)] = ^ • (D.7) 

i=i ” 

In the same way, 

F,[iee, jeVj:iX,)] = ^^ ■ (D.8) 

n n — 1 

k<cri{j}<k+p 


Proof of Lemma D.5. The first equality is straightforward. The second one results from 
simple calculations as follows. 


n 




€e, j<.Vl(X,)\ = Y. 

i=i 


-1 

liee-1 




Y I Y 

v/=l 


^liee I ^ ^ • 
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For the last equality, let us notice every j £ Vi satisfies 


Pe[iee, jeV^iXi)] 


¥,[j eV^{Xi)\iee]¥e[iee] 


n — 1 p 
n — pn ’ 


hence 


^ Pe [i e e, j G V,%Xi)] = J]Pe [i G e, j € Vi{X,)] - ^ Pe [i G e, j G V^Xi)] 

k<cri{j)<k+p i=l cri{j)<k 

p n — 1 p , P P — ^ 

= k - k -= k -• 

n n — pn nn — 1 


□ 


D.2.2 Stone’s lemma 

Lemma D.6. Given n points (xi,x„) in any of these points belongs to the k nearest 
neighbors of at most k'jd of the other points, where jd only depends on d. 

A proof of this lemma can be found in [19] (see Corollary 11.1). 

D.2.3 Stability of the A:NN classifier when removing p observations 

Lemma D.7. Let fk and g denote k-NN classifiers built respectively from {Xi, Yi),..., (A„, Yn) 
and {Xi, Yi),..., {Xn-p, Yn-p), for 1 < p < n—1. Then for a new random variable {X, Y) 
with the same distribution as {Xi,Yi), it comes 

This lemma is proved in [21, Formula 14]. 


D.2.4 Upper bound for the LIO estimator 
Lemma D.8. One has 




> e < 2 exp 


—ne 


where fk = fk{Zi^n-i] ■) is the kNN classifier built from a sample of cardinality n — 1. 
This lemma corresponds to Theorem 24.4 in [19] where the proof can be found. 
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D.2.5 Moment upper bounds for the LIO estimator 
Lemma D.9. 


E 


<q\[2 


,(^7d) 


2 \ -? 


m 


The proof is straightforward from the combination of Lemmas D.2 and D.8. 


(D.9) 


D.2.6 Upper bound on the optimal constant in the Rosenthal’s inequality 


Lemma D.IO. Let N denote a real valued standard Gaussian random variable. Then for 
every q > 2, one has 


EilTVl"] < V2e^ 


Proof of Lemma D.IO. If q is even {q = 2k > 2), then 


q\2 


r+oo 1 2 

E[|iV|'']=2/ = 

Jo 


2 {q-iy. 




+CXD 2 

2 " dx 


vr 2^ i(A: —1)! V vr 2'j/2(q/2)! 
Then using for any positive integer a 

Vrira < a\ < \/2e7ra , 

it results 


29/2(g/2)! 


< 




which implies 


If q is odd (<7 = 2A; + 1 > 2), then 

E [ W1 = yif f 

by setting x = In particular, this implies 

(2*)‘ e-‘dt = , < V2eV2 {f) * ■ 

□ 
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Lemma D.ll. Let S denote a binomial random variable such that S ~ I3{k, 1/2) (k € N*/. 
Then for every q > 3, it comes 

n\s-ns]V]<^v~eVd 



Proof of Lemma D.ll. Since S — E(S') is symmetric, it comes 


p+OO p I p-\-OQ 

E[|5-E[5]|^] = 2 / P 5<E[5]-ti/'? dt = 2q P [5 < E [5] - w] du. 
/o'- J /o 

Using Chernoff’s inequality and setting u = ^/kj2v, it results 

/ +00 2 r+oo 2 

u'^~^e~~i^du = 2qy - J v'^~^e~^dv. 

If q is even, then q — 1 > 2 is odd and the same calculations as in the proof of 
Lemma D.IO apply, which leads to 




E||S-E[S]n<2y| 2"/2(|)!<2y| 


2e/ 


2e 



If q is odd, then g — 1 > 2 is even and another use of the calculations in the proof of 
Lemma D.IO provides 


E[|,S-E[S]|''] < 2q 


r (g-1)! 

2 2 U - i )/2 2 ^! "V 2 2 ('?- i )/2 2 ^! 


= 2 


Let us notice 


< 


V2ire4 (f)’ 






q /g\('?+i)/2 / q 


q-l 


= VYe 


(g-i)/2 


q\q 


(!) 


q-l 


and also that 


This implies 


^ ^ <v^. 


q-l \q-l 

(-?+l)/2 


2('?-i)/2 


i)/22zTi - \eJ ^ 


q\'?/2 


48 


CM 



















hence 


E[|5-E[5]|^] 




□ 


Lemma D.12. Let X,Y be two i.i.d. random variables with Poisson distribution V{X) 
(X > 0). Then for every q > 3, it comes 


E[|X ^(2A + g) 


q/2 


Proof of Lemma D.12. Let us first remark that 


E[|X-yn =En[E[\X I N]] =2‘'EAr[E[|y-y/2|‘' \N]], 


where N = X + Y. Furthermore, the conditional distribution of X given = X + y is a 
binomial distribution B{N, 1/2). Then Lemma D.ll provides that 


E[|X 


N/2\^ I A^] < 




a.s., 


which entails that 


E[|X - y|''] < 2‘'EAr 





jyiD 


It only remains to upper bound the last expectation where A^ is a Poisson random variable 
V{2X) (since X,Y are i.i.d. ): 


Ejv 




< -v/Etv [Ni ] 


by Jensen’s inequality. Further introducing Touchard polynomials and using a classical 
upper bound, it comes 


Eat 




i=0 ^ 2 





/1(2A + 9)’ 



= 2 

^ (2A + . 
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Finally, one concludes 


E[|X - 


V\^ ] < 2''/2+2^^ / S. 2^ (2A + 9)"/^ < 2'?/2+ie^ ^ (2A + q) 


q 

L e 


19/2 


□ 
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